-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-54925][PYTHON] Add the capability to dump threads for pyspark #53705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
JIRA Issue Information=== New Feature SPARK-54925 === This comment was automatically generated by GitHub Actions |
| flameprof==0.4 | ||
| viztracer | ||
| debugpy | ||
| pystack>=1.5.1; python_version!='3.13' and sys_platform=='linux' # no 3.13t wheels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is only available on linux?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately yes. pystack only supports Linux. Good news is that that's actually most of our users (and most of our CIs). We can't use it on mac locally though.
Python 3.14 comes with remote exec capability which can does similar things (and supports all platforms). My goal is to use the latest python native method when possible and eventually get rid of pystack.
|
thanks, merged to master |
### What changes were proposed in this pull request?
Add an optional capability to dump thread info of *all* pyspark processes. It is intentionally hidden now because it's not fully polished. It can be used as `python -m pyspark.threaddump -p <pid>`. It requires `pystack` and `psutil`. Without these libraries the command will fail.
For now it was only used when test hangs. The result would be like:
```
Thread dump:
Dumping threads for process 1904
Traceback for thread 2175 (python3.12) [] (most recent call last):
(Python) File "/usr/lib/python3.12/threading.py", line 1032, in _bootstrap
self._bootstrap_inner()
(Python) File "/usr/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
self.run()
(Python) File "/usr/lib/python3.12/threading.py", line 1012, in run
self._target(*self._args, **self._kwargs)
(Python) File "/usr/lib/python3.12/socketserver.py", line 240, in serve_forever
self._handle_request_noblock()
(Python) File "/usr/lib/python3.12/socketserver.py", line 318, in _handle_request_noblock
self.process_request(request, client_address)
(Python) File "/usr/lib/python3.12/socketserver.py", line 349, in process_request
self.finish_request(request, client_address)
(Python) File "/usr/lib/python3.12/socketserver.py", line 362, in finish_request
self.RequestHandlerClass(request, client_address, self)
(Python) File "/usr/lib/python3.12/socketserver.py", line 766, in __init__
self.handle()
(Python) File "/workspaces/spark/python/pyspark/accumulators.py", line 327, in handle
poll(accum_updates)
(Python) File "/workspaces/spark/python/pyspark/accumulators.py", line 281, in poll
for fd, event in poller.poll(1000):
Traceback for thread 2034 (python3.12) [] (most recent call last):
(Python) File "/usr/lib/python3.12/threading.py", line 1032, in _bootstrap
self._bootstrap_inner()
(Python) File "/usr/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
self.run()
(Python) File "/workspaces/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/clientserver.py", line 58, in run
Traceback for thread 1904 (python3.12) [] (most recent call last):
(Python) File "<frozen runpy>", line 198, in _run_module_as_main
(Python) File "<frozen runpy>", line 88, in _run_code
(Python) File "/workspaces/spark/python/pyspark/sql/tests/test_udf.py", line 1790, in <module>
unittest.main(testRunner=testRunner, verbosity=2)
(Python) File "/workspaces/spark/python/pyspark/testing/__init__.py", line 30, in unittest_main
res = _unittest_main(*args, **kwargs)
(Python) File "/usr/lib/python3.12/unittest/main.py", line 105, in __init__
self.runTests()
(Python) File "/usr/lib/python3.12/unittest/main.py", line 281, in runTests
self.result = testRunner.run(self.test)
(Python) File "/usr/local/lib/python3.12/dist-packages/xmlrunner/runner.py", line 67, in run
test(result)
(Python) File "/usr/lib/python3.12/unittest/suite.py", line 84, in __call__
return self.run(*args, **kwds)
(Python) File "/usr/lib/python3.12/unittest/suite.py", line 122, in run
test(result)
(Python) File "/usr/lib/python3.12/unittest/suite.py", line 84, in __call__
return self.run(*args, **kwds)
(Python) File "/usr/lib/python3.12/unittest/suite.py", line 122, in run
test(result)
(Python) File "/usr/lib/python3.12/unittest/case.py", line 690, in __call__
return self.run(*args, **kwds)
(Python) File "/usr/lib/python3.12/unittest/case.py", line 634, in run
self._callTestMethod(testMethod)
(Python) File "/usr/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
if method() is not None:
(Python) File "/workspaces/spark/python/pyspark/sql/tests/test_udf.py", line 212, in test_chained_udf
[row] = self.spark.sql("SELECT double_int(double_int(1) + 1)").collect()
(Python) File "/workspaces/spark/python/pyspark/sql/classic/dataframe.py", line 469, in collect
sock_info = self._jdf.collectToPython()
(Python) File "/workspaces/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/java_gateway.py", line 1361, in __call__
(Python) File "/workspaces/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/java_gateway.py", line 1038, in send_command
(Python) File "/workspaces/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/clientserver.py", line 535, in send_command
(Python) File "/usr/lib/python3.12/socket.py", line 720, in readinto
return self._sock.recv_into(b)
Dumping threads for process 2191
Traceback for thread 2191 (python3.12) [] (most recent call last):
(Python) File "<frozen runpy>", line 198, in _run_module_as_main
(Python) File "<frozen runpy>", line 88, in _run_code
(Python) File "/workspaces/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 287, in <module>
(Python) File "/workspaces/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 180, in manager
Dumping threads for process 2198
Traceback for thread 2198 (python3.12) [] (most recent call last):
(Python) File "<frozen runpy>", line 198, in _run_module_as_main
(Python) File "<frozen runpy>", line 88, in _run_code
(Python) File "/workspaces/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 287, in <module>
(Python) File "/workspaces/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 259, in manager
(Python) File "/workspaces/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 88, in worker
(Python) File "/workspaces/spark/python/lib/pyspark.zip/pyspark/util.py", line 981, in wrapper
(Python) File "/workspaces/spark/python/lib/pyspark.zip/pyspark/worker.py", line 3439, in main
(Python) File "/workspaces/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 583, in read_int
Dumping threads for process 2257
Traceback for thread 2257 (python3.12) [] (most recent call last):
(Python) File "<frozen runpy>", line 198, in _run_module_as_main
(Python) File "<frozen runpy>", line 88, in _run_code
(Python) File "/workspaces/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 287, in <module>
(Python) File "/workspaces/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 259, in manager
(Python) File "/workspaces/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 88, in worker
(Python) File "/workspaces/spark/python/lib/pyspark.zip/pyspark/util.py", line 981, in wrapper
(Python) File "/workspaces/spark/python/lib/pyspark.zip/pyspark/worker.py", line 3439, in main
(Python) File "/workspaces/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 583, in read_int
```
Notice that it has not only the driver process, but also daemon and worker process. The plan is to incorporate this into our existing debug framework so `threaddump` button will return both JVM executor threads and python worker threads.
### Why are the changes needed?
We need insights into python worker/daemon.
We have some thread dump capability in our test, but that's not stable. `SIGTERM` sometimes is hooked and `faulthandler` can't work properly. Also it can't dump the subprocesses.
### Does this PR introduce _any_ user-facing change?
Yes, but it's hidden for now. A new command entry is introduced.
### How was this patch tested?
Locally it works.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes apache#53705 from gaogaotiantian/python-threaddump.
Authored-by: Tian Gao <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
What changes were proposed in this pull request?
Add an optional capability to dump thread info of all pyspark processes. It is intentionally hidden now because it's not fully polished. It can be used as
python -m pyspark.threaddump -p <pid>. It requirespystackandpsutil. Without these libraries the command will fail.For now it was only used when test hangs. The result would be like:
Notice that it has not only the driver process, but also daemon and worker process. The plan is to incorporate this into our existing debug framework so
threaddumpbutton will return both JVM executor threads and python worker threads.Why are the changes needed?
We need insights into python worker/daemon.
We have some thread dump capability in our test, but that's not stable.
SIGTERMsometimes is hooked andfaulthandlercan't work properly. Also it can't dump the subprocesses.Does this PR introduce any user-facing change?
Yes, but it's hidden for now. A new command entry is introduced.
How was this patch tested?
Locally it works.
Was this patch authored or co-authored using generative AI tooling?
No.